-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Protein function prediction with GO - Part 3 #64
base: dev
Are you sure you want to change the base?
Conversation
- migration from deep go format to chebai->go_uniprot format
- +migration structure changes
I have made the suggested changes for migration. Please check. Config for DeepGO1: class_path: chebai.preprocessing.datasets.go_uniprot.DeepGO1MigratedData
init_args:
go_branch: "MF"
max_sequence_length: 1002
reader_kwargs: {n_gram: 3} Config for DeepGO2: class_path: chebai.preprocessing.datasets.go_uniprot.DeepGO2MigratedData
init_args:
go_branch: "MF"
max_sequence_length: 1000
reader_kwargs: {n_gram: 3} |
I ran the migration script for DeepGO2 and tried training a model with the data. I noticed three issues:
@aditya0by0 Can you have a look at that? Edit: I used the following commands:
Run:
|
I agree that invalid amino acids like "O" and "U" are not explicitly enforced as invalid by DeepGO2, but they are also not explicitly treated as valid in their data pipeline. To elaborate, DeepGO2 includes the amino acid notation "X" in its set of valid amino acids, where "X" represents any/unknown amino acids (as per [Wikipedia]. In their pipeline, when invalid amino acids such as "U", "O", "B", "Z", "J", or "*" are encountered in the protein sequences, they are effectively mapped to "X". This is evident in their implementation, where the index of "X" is used for any amino acid that doesn't belong to the valid set. You can see this behavior in their code here: So, while "O" and "U" are not explicitly handled, the use of "X" as a catch-all ensures that any invalid amino acids are safely represented. If we want to follow the approach mentioned above, we would need to replace every invalid amino acid in the sequence with "X" as part of a pre-processing step before tokenization. This to avoid inconsistencies in the n-gram tokenization process. Please let me know how we want to proceed with it. |
Ok, so if DeepGO2 replaces every not explicitly valid amino acid with X, then we should do the same. I think that is the easiest solution. |
- modify deepgo2 migration script to migrate the esm2 embeddings too - modify migration class to use esm2 embeddings or reader features, based on input
- this will help to identify methods that needs to be implemented during coding and not during runtime
Commit which added ESM2 embedding in migration and training process : e7b3d80 This commit has changes which also includes the ESM2 embeddings from in deepgo2 migration process. @sfluegel05, Please find the config to train model on DeepGO 2 migrated data with ESM2 embeddings. class_path: chebai.preprocessing.datasets.deepGO.go_uniprot.DeepGO2MigratedData
init_args:
go_branch: "MF"
max_sequence_length: 1000
use_esm2_embeddings: True |
I added the ESM2 config as well as a simple feed forward network. ESM2 with Electra does not work out of the box, since ESM2 uses real values that may be negative. Electra expects positive values only. Also, I'm not sure how sensible using Electra here would be - Electra expects a sequence, but ESM2 provides an embedding vector that is not directly related to the sequence (tell me if I'm wrong here). |
Yes, you're absolutely right about the compatibility issues between ESM2 and Electra. ESM2 embeddings can have negative values as they are activation values from specific layer within ESM2 network. We can perform ReLU or normalization as pre-processing to ESM2 embeddings to shift them into positive domain, before feeding it to Electra, but I doubt whether it will still work due to previous stated reasons. |
PR for the Issue Protein function prediction with GO #36
Note: The above issue will be implemented in 3 PRs:
Changes to be done in this PR
From comment #36 (comment)